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Abstract — Nowadays, as information systems are more open to the Internet, the importance of secure networks is 
tremendously increased. New intelligent Intrusion Detection Systems (IDSs) which are based on sophisticated algorithms 
rather than current signature-base detections are in demand. There is often the need to update an installed Intrusion 
Detection System (IDS) due to new attack methods or upgraded computing environments. Since many current Intrusion 
Detection Systems are constructed by manual encoding of expert knowledge, changes to them are expensive and slow. In 
data mining-based intrusion detection system, we should make use of particular domain knowledge in relation to 
intrusion detection in order to efficiently extract relative rules from large amounts of records. This paper proposes new 
ensemble boosted decision tree approach for intrusion detection system. Experimental results shows better results for 
detecting intrusions as compared to others existing methods. 

Index Terms — boosted decision trees, data mining, ensemble approach, network intrusion detection system 

I. INTRODUCTION 

Being widely used and quickly developed in recent years, network technologies have provided us with new life 
and shopping experiences, particularly in the fields of e-business, e-learning and e-money. But along with network 
development, there has come a huge increase in network crime. It not only greatly affects our everyday life, which relies 
heavily on networks and Internet technologies, but also damages computer systems that serve our daily activities, including 
business, learning, entertainment and so on. Besides of this internal hacking is difficult to detect because firewalls and 
Intrusion Detection Systems usually only defend against outside attacks. Intrusion Detection System (IDS) [3] is an 
important detection used as a countermeasure to preserve data integrity and system availability from attacks. Intrusion 
Detection Systems (IDS) is a combination of software and hardware that attempts to perform intrusion detection. Intrusion 
detection is a process of gathering intrusion related knowledge occurring in the process of monitoring the events and 
analyzing them for sign or intrusion. It raises the alarm when a possible intrusion occurs in the system. The network data 
source of intrusion detection consists of large amount of textual information, this is difficult to comprehend and analyze. 
Many IDS can be described with three fundamental functional components -Information Source, Analysis, and Response. 
Different sources of information and events based on information are gathered to decide whether intrusion has taken place. 
This information is gathered at various levels like system, host, application, etc. Based on analysis of this data, we can detect 
the intrusion based on two common practices - Misuse detection and Anomaly detection. Misuse detection is based on 
extensive knowledge of patterns associated with known attacks provided by human experts. Pattern matching, data mining, 
and state transition analysis are some of the approaches for Misuse detection. Anomaly detection is based on profiles that 
represent normal behavior of users, hosts, networks, and detecting attacks of significant deviation from these profiles. 
Statistical methods, expert system are some of the methods for intrusion detection based on Anomaly detection. The main 
motivation behind using intrusion detection in data mining [5, 10, 12, 13, 15, 18] is automation. Pattern of the normal 
behavior and pattern of the intrusion can be computed using data mining. To apply data mining techniques in intrusion 
detection, first, the collected monitoring data needs to be preprocessed and converted to the format suitable for mining 
processing. Next, the reformatted data will be used to develop a clustering or classification model. The classification model 
can be rule-based, decision-tree based, association-rule based, Bayesian-network based, or neural network based. 

Intrusion Detection mechanism based on IDS are not only automated but also provides for a significantly elevated 
accuracy and efficiency. Unlike manual techniques, Data Mining ensures that no intrusion will be missed while checking 
real time records on the network. Credibility is important in every system. IDS are now becoming important part of our 
security system, and its credibility also adds value to the whole system. Data mining techniques can be applied to gain 
insightful knowledge of intrusion prevention mechanisms. They can help detect new vulnerabilities and intrusions, discover 
previous unknown patterns of attacker behaviors, and provide decision support for intrusion management. The proposed 
paper organized as, Section 2 explains about data mining. Section 3 introduces boosted decision tree. Experiment and result 
included in Section 4 with concluding Conclusion in section 5. 

II. DATA MINING 

Data mining (DM), also called Knowledge-Discovery and Data Mining, is one of the hot topic in the field of 
knowledge extraction from database. Data mining is used to automatically learn patterns from large quantities of data. 
Mining can efficiently discover useful and interesting rules from large collections of data. It is a fairly recent topic in 
computer science but utilizes many older computational techniques from statistics, information retrieval, machine learning 
and pattern recognition. Data mining is disciplines works to finds the major relations between collections of data and enables 
to discover a new and anomalies behavior. Data mining based intrusion detection techniques generally fall into one of two 
categories; misuse detection and anomaly detection. In misuse detection, each instance in a data set is labeled as 'normal' or 
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'intrusion' and a learning algorithm is trained over the labeled data. These techniques are able to automatically retrain 
intrusion detection models on different input data that include new types of attacks, as long as they have been labeled 
appropriately. Data mining are used in different field such as marketing, financial affairs and business organizations in 
general and proof it is success. The main approaches of data mining that are used including classification which maps a data 
item into one of several predefined categories. This approach normally output "classifiers" has ability to classify new data in 
the future, for example, in the form of decision trees or rules. 

An ideal application in intrusion detection will be together sufficient "normal" and "abnormal" audit data for a 
user or a program. The second important approach is clustering which maps data items into groups according to similarity or 
distance between them. Anomaly detection techniques thus identify new types of intrusions as deviations from normal usage 
[7, 8]. In statistics based outlier detection techniques [4] the data points are modeled using a stochastic distribution and 
points are determined to be outliers depending upon their relationship with this model. However, with increasing 
dimensionality, it becomes increasingly difficult and in-accurate to estimate the multidimensional distributions of the data 
points [1]. However, recent outlier detection algorithms that we utilize in this study are based on computing the full 
dimensional distances of the points from one another [9, 16] as well as on computing the densities of local neighborhoods 
[6]. Classifier construction is another important research challenge to build efficient IDS. Nowadays, many data mining 
algorithms have become very popular for classifying intrusion detection datasets such as decision tree, naive Bayesian 
classifier, neural network, genetic algorithm, and support vector machine etc. However, the classification accuracy of most 
existing data mining algorithms needs to be improved, because it is very difficult to detect several new attacks, as the 
attackers are continuously changing their attack patterns. Anomaly network intrusion detection models are now using to 
detect new attacks but the false positives are usually very high. The performance of an intrusion detection model depends on 
its detection rates (DR) and false positives (FP). Ensemble approaches [14, 17] have the advantage that they can be made to 
adopt the changes in the stream more accurately than single model techniques. Several ensembles approaches have been 
proposed for classification of evolving data streams. Ensemble classification technique is advantageous over single 
classification method. It is combination of several base models and it is used for continuous learning. Ensemble classifier has 
better accuracy over single classification technique. Bagging and boosting are two of the most well-known ensemble 
learning methods due to their theoretical performance guarantees and strong experimental results. Boosting has attracted 
much attention in the machine learning community as well as in statistics mainly because of its excellent performance and 
computational attractiveness for large datasets. 

III. BOOSTED DECISION TREE 

This proposed model uses boosted decision tree i.e. Hoeffding tree classification techniques to increase 
performance of the intrusion detection system. Boosted Decision Tree- The underlying idea of boosting is to combine simple 
rules to form an ensemble such that the performance of the single ensemble member is improved, i.e. boosted. Let hi, h2, .... 
hn be a set of hypotheses and consider the composite ensemble hypothesis, Here _n denotes the coefficient with which the 
ensemble member hn is combined; both _n and the learner or hypothesis are to be learned within the boosting procedure. 

The boosting algorithm initiates by giving all data training tuples the same weight wO. After a classifier is built, the 
weight of each tuple is changed according to the classification given by that classifier. Then, a second classifier is built using 
the reweighted training tuple. The final classification of intrusion detection is a weighted average of the individual 
classifications over all classifiers. There are several methods to update the weights and combine the individual classifiers. 

After the kth decision tree is built, the total misclassification error _k of the tree, defined as the sum of the weights 
of Misclassified tuples over the sum of the weights of all tuples, is calculated. 
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And the tree k+1 is constructed. Note that, as the algorithm progresses, the predominance of hard-to-classify instances in the 
training set is increased. The final classification of tuple I is a weighted sum of the classifications over the individual trees. 
Furthermore, trees with lower misclassification errors "k are given more weight when the final classification is computed. In 
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decision tree i.e. hoeffding tree, each node contains a test on an attribute, each branch from a node corresponds to a possible 
outcome of the test and each leaf contains a class prediction. A decision tree is learned by recursively replacing leaves by test 
nodes, starting at the root. The attribute to test data node is chosen by comparing all the available attributes and choosing the 
best one. 

For classifying examples in the dataset, the prior and conditional probabilities generated from the dataset are used 
to make the prediction. This is done by combining the effects of the different attributes values from the example. Suppose 
the example ej has independent attribute values {ail, ai2,..., aip}, we know P(aik | cj), for each class cj and attribute ajk and 
then estimate P(ejlcj) 

To classify an example in the dataset, the algorithm estimates the likelihood that ei is in each class. The probability 
that ei is in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. 
The posterior probability P(cj I ei) is then found for each class and the example classifies with the highest posterior 
probability for that example. The algorithm will continue this process until all the examples of sub-datasets or sub-sub 
datasets are correctly classified. When the algorithm correctly classifies all the examples of all sub or sub-sub datasets, then 
the algorithm terminates and the prior and conditional probabilities for each sub or sub-sub-datasets are preserved for future 
classification of unseen examples. In this proposed scheme boosting method improves ensemble performance by using 
adaptive window and adaptive size hoeffding tree as base learner. Because of this algorithm woks faster and increases 
performance. It uses dynamic sample weight assignment technique. In this algorithm adaptive sliding window is parameter 
and assumption free in the sense that it automatically detects and adapts to the current rate of change. Its only parameter is a 
confidence bound _. Window is not maintained explicitly but compressed using a variant of the exponential histogram 
technique. It keeps the window of length W using only O (log W) memory & O (log W) processing time per item, rather 
than the O (W) one expects from a naive implementation. It is used as change detector since it shrinks window if and only if 
there has been significant change in recent examples, and estimator for the current average of the sequence it is reading 
since, with high probability, older parts of the window with a significantly different average are automatically dropped. 

IV. EXPERIMENT AND RESULT 

The proposed boosted decision trees algorithm is tested on KDDCup'99 dataset [11] and compared to that of a 
Naive Bayes, kNN, eClassO [2], eClassl [2] and the Winner (KDDCup'99). 

A. Evaluation of Anomaly Detection 

There are generally two types of attacks in network intrusion detection: the attacks that involve single connections 
and the attacks that involve multiple connections (bursts of connections). The standard metrics in Table 1 treat all types of 
attacks similarly thus failing to provide sufficiently generic and systematic evaluation for the attacks that involve many 
network connections. 

TABLE I. CONFUSION MATRIX FOR EVALUATION OF INTRUSION 

DETECTION Interleaved Test-Then-Train - In this method each individual example can be used to test the model before it 
is used for training and from this the accuracy can be incrementally updated. The intension behind using this method is that, 
the model is always being tested on examples it has not seen. The advantage over holdout method being that holdout set is 
not needed for testing and ensures a smooth plot of accuracy over time as each individual example will become increasingly 
less significant to the overall average. 

B. Evaluation on Kddcup'99 Data Set 

The experiment is carried out on a intrusion detection real data stream which has been used in the Knowledge 
Discovery and Data Mining (KDD) 1999 Cup competition. In KDD99 dataset the input data flow contains the details of the 
network connections, such as protocol type, connection duration, login type etc. Each data sample in KDD99 dataset 
represents attribute value of a class in the network data flow, and each Class is labeled either as normal or as an attack with 
exactly one specific attack type. In total, 41 features have been used in KDD99 dataset and each connection can be 
categorized into five main classes as one normal class and four main intrusion classes as DOS, U2R, R2L and Probe. There 
are 22 different types of attacks that are grouped into the four main types of attacks DOS, U2R, R2L and Probe tabulated in 
Table 
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The experimental setting is for the KDD99 Cup, taking 10% of the whole real raw data stream (494021 data samples) and 12 
features are selected as per proposed algorithm. Figures 1(a) - 1(e) show graphical comparison of boosted decision trees 
algorithm with the Winner (KDDCup'99), eClassO,eClassl, kNN, C4.5 and Naive Bayes in terms of accuracy or detection 
rate. 

V. CONCLUSION 

Here our model employs feature selection so that the binary classifier for each type of attack can be more accurate, 
which improves the detection of attacks that occur less frequently in the training data. Based on the accurate binary 
classifiers, our model applies a new ensemble approach which aggregates each binary classifier's decisions for the same 
input and decides which class is most suitable for a given input. During this process, the potential bias of certain binary 
classifier could be alleviated by other binary classifiers' decision. Our model also makes use of multi-boosting for reducing 
both variance and bias. Visualization presents the analyzed result in a different setting to further enhance the analysis. The 
GUI in Java allows the tool to be used in different platforms. This tool is tested and demonstrated through several real 
network datasets. 

Further, we use boosted decision tree i.e. hoeffding tree classification technique to increase performance of the 
intrusion detection system. We use the learning technique that allows combining several decision trees to form a classifier 
which is obtained from a weighted majority vote of the classifications given by individual trees. The generalization accuracy 
of boosted decision trees will be compared with Naive Bayes, k-NN, eClassO, eClassl and the Winner. A decision tree is 
learned by recursively replacing leaves by test nodes, starting at the root. The attribute to test at a node is chosen by 
comparing all the available attributes and choosing the best one. Apache Tomcat Server is used for regularly finding every 
new attack in different PC's of LAN and evaluate the attack and update the PC's accordingly on a day to day basis. 
Therefore, by implementing new ensemble boosted decision tree approach for Data Mining based Intrusion Detection 
System we can trace all the features of the System in a particular LAN to maintain a good network and monitor the 
performance of the System there by providing enhanced Network management. Experimental results shows better results for 
detecting intrusions as compared to others existing methods. 
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